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Abstract 

Divergence from a random baseline is a tech- 
nique for the evaluation of document cluster- 
ing. It ensures cluster quality measures are per- 
forming work that prevents ineffective cluster- 
ings from giving high scores to clusterings that 
provide no useful result. These concepts are de- 
fined and analysed using intrinsic and extrinsic 
approaches to the evaluation of document clus- 
ter quality. This includes the classical clusters 
to categories approach and a novel approach that 
uses ad hoc information retrieval. The diver- 
gence from a random baseline approach is able to 
differentiate ineffective clusterings encountered 
in the INEX XML Mining track. It also appears 
to perform a normalisation similar to the Nor- 
malised Mutual Information (NMI) measure but 
it can be applied to any measure of cluster qual- 
ity. When it is applied to the intrinsic measure 
of distortion as measured by RMSE, subtraction 
from a random baseline provides a clear optimum 
that is not apparent otherwise. This approach can 
be applied to any clustering evaluation. This pa- 
per describes its use in the context of document 
clustering evaluation. 



1 Introduction 

This paper extends, motivates and analyses a document 
clustering evaluation approach that compensates for inef- 
fective document clusterings during evaluation. An inef- 
fective clustering is one that achieves a high score accord- 
ing to a measure of document cluster quality but provides 
no value as a clustering solution. Divergence from a ran- 
dom baseline is introduced and formally defined to address 
ineffective clusterings in evaluation. A notion of work 
performed by a clustering is introduced where ineffective 
cases appear to perform no useful learning. The paper is 
concluded with a detailed analysis of the results from the 
INEX 2010 XML Mining track. This paper clearly defines 
and motivates this approach with theoretical and experi- 
mental analysis. 

Ineffective document clusterings have been investigated 
using two extrinsic evaluations. The first is the standard 
clusters to categories approach where document clusters 
are compared to a ground truth set of category labels. The 
second approach evaluates document clustering using ad 
hoc information retrieval that has a use case for collec- 
tion selection where a document collection is distributed 
across many machines. A broker needs to direct a search 



query to machines containing relevant documents. If the 
documents are allocated to machines by document cluster, 
it is expected that only a few topical clusters need to be 
searched. This is motivated by the cluster hypothesis | 20] 
that states relevant documents tend to be more similar to 
each other than non-relevant documents. The Normalised 
Cumulative Cluster Gain (NCCG) measure evaluates doc- 
ument clustering with respect to this use case for ad hoc 
information retrieval. 

The paper proceeds as follows. Section [2] introduces 
the collaborative XML document mining evaluation fo- 
rum at INEX. Section [3] introduces document clustering 
in an information retrieval context and discusses different 
approaches. Evaluation of document clustering using the 
clusters to categories approach and ad hoc relevance judge- 
ments is discussed in Section [4] Sections |5]|6] and [7] intro- 
duce and define ineffective clusterings that perform no use- 
ful learning and can be adjusted for by applying divergence 
from a random baseline. Section[8]analyses the application 
of divergence from a random baseline using the INEX 2010 
XML mining track. The paper is concluded in Section|9] 

2 INEX XML Mining Track 

The XML document mining track was run for six years at 
INEX, the Initiative for the Evaluation of XML Informa- 
tion Retrieval (ill EE CEU US . It explored the emerg- 
ing field of classification and clustering of semi-structured 
documents. 

Document clustering has been evaluated at INEX using 
the standard clusters to categories approach, where cate- 
gories extracted from the Wikipedia were used as a ground 
truth. Clusterings produced by different systems were eval- 
uated using measures such as Purity, Entropy, Fl and NMI, 
indicating how well the clusters match the categories. 

A novel approach to document clustering evaluation was 
introduced at INEX in 2009 (26j and 2010 EJ. It used 
ad hoc information retrieval to evaluate document cluster- 
ing by using relevance judgments from retrieval systems in 
the ad hoc track II 3411 . Ad hoc information retrieval evalua- 
tion is a system based approach that evaluates how different 
systems rank relevant documents. For systems to be com- 
pared, the same set of information needs and documents 
have to be used. A test collection consists of documents, 
statements of information need, and relevance judgments 
11361 . Relevance judgments are often binary and any docu- 
ment is considered relevant if any of its contents can con- 
tribute to the satisfaction of the specified information need. 
However, the ad hoc track at INEX provides additional rel- 
evance information where assessors highlight the relevant 
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text in the documents. Information needs are also referred 
to as topics and contain a textual description of the informa- 
tion need, including guidelines as to what may or may not 
be considered relevant. Typically, only the keyword based 
query of a topic is given to a retrieval system. 

The ad hoc information retrieval based evaluation of 
document clustering is motivated by the cluster hypothe- 
sis that suggests relevant documents are more similar to 
each other than non-relevant documents; relevant docu- 
ments tend to cluster together. The spread of relevant 
documents over a clustering solution was measured using 
the Normalised Cumulative Cluster Gain (NCCG) measure 
in the INEX XML mining track in 2009 and 2010 (26l 
0. This evaluation approach also has a specific use case 
in information retrieval. It evaluates clustering of a doc- 
ument collection for collection selection. Collection se- 
lection involves selecting a subset of a collection given a 
query. Typically, these subsets are distributed on different 
machines. The goal is to cluster documents such that only 
a small fraction of clusters, and therefore machines, need 
to be searched to find most of the relevant documents for a 
given query. This leads to improved run time performance 
as only a fraction of the collection needs to be searched. 
The total load over a distributed system is decreased as only 
a few machines need to be searched per query instead of ev- 
ery machine. It also provides a clear use case for document 
clustering evaluation. By contrast, comparing document 
clusters to predefined categories only evaluates clustering 
as a match against a particular classification. 

This paper uses the INEX 2010 XML Mining track 
dataset [8.1. It is a 146,225 document subset of the INEX 
XML Wikipedia collection determined by the reference run 
used for the ad hoc track 0. The reference run contains 
the 1500 highest ranked documents for each of the queries 
in the ad hoc track. The queries were searched using an 
implementation of Okapi BM25 in the ATIRE 1135 1 search 
engine. 

Topical categories for documents are one of many views 
of extrinsic cluster quality. They are derived from what hu- 
mans perceive as topics in a document collection. When 
categories are used for evaluation, a document clustering 
system is given a score indicating how well the clusters 
match the predefined categories. This is the most preva- 
lent approach to evaluation of document clustering in the 
research literature. 

The categories for the INEX 2010 XML Mining col- 
lection were extracted from the Wikipedia category graph 
which is noisy and nonsensical at times. Therefore, an ap- 
proach using shortest paths in the graph was used to extract 
36 categories [18] . 

3 Document Clustering 

Document clustering is used in many different contexts, 
such as exploration of structure in a document collection 
for knowledge discovery [ 33 1, dimensionality reduction for 
other tasks such as classification 11221 . clustering of search 
results for an alternative presentation to the ranked list 1 1911 
and pseudo-relevance feedback in retrieval systems 1123 1 . 

Recently there has been a trend towards exploiting semi- 
structured documents 11271 II II . This uses features such as 
the XML tree structure and hyper-link graphs to derive data 
from documents to improve the quality of clustering. 

Document clustering groups documents into topics with- 
out any knowledge of the category structure that exists in 
a document collection. All semantic information is derived 



from the documents themselves. It is often referred to as 
unsupervised clustering. In contrast, document classifica- 
tion is concerned with the allocation of documents to prede- 
fined categories where there are labeled examples to learn 
from. Clustering for classification is referred to as super- 
vised learning where a classifier is learned from labeled 
examples and used to predict the classes of unseen docu- 
ments. 

The goal of clustering is to find structure in data to form 
groups. As a result, there are many different models, learn- 
ing algorithms, encoding of documents and similarity mea- 
sures. Many of these choices lead to different induction 
principles [ 14| which result in discovery of different clus- 
ters. An induction principle is an intuitive notion as to what 
constitutes groups in data. For example, algorithms such as 
k-means II24II and Expectation Maximisation |9I use a rep- 
resentative based approach to clustering where a prototype 
is found for each cluster. These prototypes are referred to 
as means, centers, centroids, medians and medoids 1141 . A 
similarity measure is used to compare the representatives 
to examples being clustered. These choices determine the 
clusters discovered by a particular approach. 

A popular model for learning with documents is the Vec- 
tor Space Model (VSM) 1 30 1. Each dimension in the vector 
space is associated with one term in the collection. Term 
frequency statistics are collected by parsing the document 
collection and counting how many times each term appears 
in each document. This is supported by the distributional 
hypothesis [18.1 from linguistics that theorises that words 
that occur in the same context tend to have similar mean- 
ings. If two documents use a similar vocabulary and have 
similar term frequency statistics then they are likely to be 
topically related. The end result is a high dimensional, 
sparse document-by-term matrix who's properties can be 
explained by Zipf distributions 114 111 in term occurrence. 
The matrix represents a document collection where each 
row is a document and each column is a term in the vocab- 
ulary. In the clustering process, document vectors are of- 
ten compared using the cosine similarity measure. The co- 
sine similarity measure has two properties that make it use- 
ful for comparing documents. Document vectors are nor- 
malised to unit length when they are compared. This nor- 
malisation is important since it accounts for the higher term 
frequencies that are expected in longer documents. The in- 
ner product that is used in computing the cosine similarity 
has non-zero contributions only from words that occur in 
both documents. Furthermore, sparse document represen- 
tation allows for efficient computation. 

Different approaches exist to weight the term frequency 
statistics contained in the document-by-term matrix. The 
goal of this weighting is to take into account the relative im- 
portance of different terms, and thereby facilitate improved 
performance in common tasks such as classification, clus- 
tering and ad hoc retrieval. Two popular approaches are 
TF-IDF (29) and BM25 (H [38) . 

Clustering algorithms can be characterized by two prop- 
erties. The first determines if cluster membership is dis- 
crete. Hard clustering algorithms only assign each docu- 
ment to one cluster. Soft clustering algorithms assign doc- 
uments to one or more clusters in varying degree of mem- 
bership. The second determines the structure of the clusters 
found as being either flat or hierarchical. Flat clustering 
algorithms produce a fixed number of clusters with no re- 
lationships between the clusters. Hierarchical approaches 
produce a tree of clusters, starting with the broadest level 



clusters at the root and the narrowest at the leaves. 

K-means 1 24-] is one of the most popular learning algo- 
rithms for use with document clustering and other cluster- 
ing problems. It has been reported as one of the top 10 
algorithms in data mining |39|. Despite research into many 
other clustering algorithms it is often the primary choice 
for practitioners due to its simplicity [ 17] and quick conver- 
gence [ 1 1. Other hierarchical clustering approaches such as 
repeated bisecting k-means 1321 . K-tree Q and agglomer- 
ative hierarchical clustering 1 32 1 have also been used. Fur- 
ther methods such as graph partitioning algorithms 1121 1 , 
matrix factorisation [40], topic modeling 1.5.1 and Gaussian 
mixture models 1.9.1 have also been used. 

The k-means algorithm 12411 uses the vector space model 
by iteratively optimising k centroid vectors which represent 
clusters. These clusters are updated by taking the mean of 
the nearest neighbours of the centroid. The algorithm pro- 
ceeds to iteratively optimise the sum of squared distances 
between the centroids and the set of vectors that they are 
nearest neighbours to (clusters). This is achieved by it- 
eratively updating the centroids to the cluster means and 
reassigning nearest neighbours to form new clusters, un- 
til convergence. The centroids are initialized by selecting k 
vectors from the document collection uniformly at random. 
It is well known that k-means is a special case of Expecta- 
tion Maximisation 191 with hard cluster membership and 
isotropic Gaussian distributions. 

The k-means algorithm has been shown to converge in 
a finite amount of time 113 111 as each iteration of the algo- 
rithm visits a possible permutation without revisiting the 
same permutation twice, leading to a worst case analysis 
of exponential time. Arthur et. al. QJ have performed a 
smoothed analysis to explain the quick convergence of k- 
means theoretically. This is the same analysis that has been 
applied to the simplex algorithm, which has a n 2 worst 
case complexity but usually converges in linear time on 
real data. While there are point sets that can force k-means 
to visit every permutation, they rarely appear in practical 
data. Furthermore, most practitioners limit the number of 
iterations k-means can run for, which results in linear time 
complexity for the algorithm. While the original proof of 
convergence applies to k-means using squared Euclidean 
distance 1311 . newer results show that other similarity mea- 
sures from the Bregman divergence class of measures can 
be used with the same complexity guarantees 1 3 1. This in- 
cludes similarity measures such as KL-divergence, logis- 
tic loss, Mahalanobis distance and Itakura-Saito distance. 
Ding and He 1 1311 demonstrate the relationship between k- 
means and Principle Component Analysis. PCA is usually 
thought of as a matrix factorisation approach for dimen- 
sionality reduction where as k-means is considered a clus- 
tering algorithm. It is shown that PCA provides a solution 
to the relaxed k-means problem, thus formally creating a 
link between k-means and matrix facortisation methods. 

4 Document Clustering Evaluation 

Evaluating document clustering is a difficult task. Intrin- 
sic or internal measures of quality such as distortion or log 
likelihood only indicate how well an algorithm optimised a 
particular representation. Intrinsic comparisons are inher- 
ently limited by the given representation and are not com- 
parable between different representations. Extrinsic or ex- 
ternal measures of quality compare a clustering to an ex- 
ternal knowledge source such as a ground truth labeling 
of the collection or ad hoc relevance judgments. This al- 



lows comparison between different approaches. Extrinsic 
views of truth are created by humans and suffer from the 
tendency for humans to interpret document topics differ- 
ently. Whether a document belongs to a particular topic or 
not can be subjective. To further complicate the problem 
there are many valid ways to cluster a document collection. 
It has been noted that clustering is ultimately in the eye of 
the beholder OH. 

When comparing a cluster solution to a labeled ground 
truth, the standard measures of Purity, Entropy, NMI and 
Fl are often used to determine the quality of clusters with 
regard to the categories. Let w = {ifli, W2, ■ ■ ■ , wk} 
be the set of clusters for the document collection D and 
£ = {ci, C2, . . . , cj} be the set of categories. Each clus- 
ter and category is a subset of the document collection, 
Vc G £, w G oj : c, w C D. Purity assigns a score based on 
the fraction of a cluster that is the majority category label, 
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in the interval [0, 1] where is absence of purity and 1 is 
total purity. Entropy defines a probability for each category 
and combines them to represent order within a cluster, 
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which falls in the interval [0, 1] where is total order and 
1 is complete disorder. Fl identifies a true positive (tp) as 
two documents of the same category in the same cluster, a 
true negative (tri) as two documents of different categories 
in different clusters and a false negative (fn) as two docu- 
ments of the same category in different clusters where the 
score combines these classification judgements using the 
harmonic mean, 
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The Purity, Entropy and Fl scores assign a score to each 
cluster which can be micro or macro averaged across all the 
clusters. The micro average weights each cluster by its size, 
giving each document in the collection equal importance in 
the final score. The macro average is simply the arithmetic 
mean, ignoring the size of the clusters. NMI makes a trade- 
off between the number of clusters and quality in an infor- 
mation theoretic sense. For a detailed explanation of these 
measures please consult Manning et. al. 1 1251 . 

4.1 NCCG 

The NCCG evaluation measure has been used for the eval- 
uation of document clustering at INEX 11261 . It is mo- 
tivated by van Rijsbergen's cluster hypothesis II 2 II . If the 
hypothesis holds true, then relevant documents will appear 
in a small number of clusters. A document clustering solu- 
tion can be evaluated by measuring the spread of relevant 
documents for the given set of queries. 

NCCG is calculated using manual result assessments 
from ad hoc retrieval evaluation. Evaluations of ad hoc 
retrieval occur in forums such as INEX 0, CLEF 1 1511 
and TREC J6j. The manual query assessments are called 
the relevance judgments and have been used to evaluate ad 
hoc retrieval of documents. The process involves defining 
a query based on the information need, a retrieval system 
returning results for the query and humans judging whether 



the results returned by a system are relevant to the informa- 
tion need. 

The NCCG measure tests a clustering solution to deter- 
mine the quality of clusters relative to the optimal collec- 
tion selection. Collection selection involves splitting a col- 
lection into subsets and recommending which subsets need 
to be searched for a given query. This allows a retrieval sys- 
tem to search fewer documents, resulting in improved run- 
time performance over searching the entire collection. The 
NCCG measure has complete knowledge of which docu- 
ments are relevant to queries and orders clusters in descend- 
ing order by the number of relevant documents it contains. 
We call this measure an "oracle" because it has complete 
knowledge of relevant documents. A working retrieval sys- 
tem does not have this property, so this measure represents 
an upper bound on collection selection performance. 

Better clustering solutions in this context will tend to 
group together relevant results for previously unseen ad hoc 
queries. Real ad hoc retrieval queries and their manual as- 
sessment results are utilised in this evaluation. This ap- 
proach evaluates the clustering solutions relative to a very 
specific objective - clustering a large document collection 
in an optimal manner in order to satisfy queries while min- 
imising the search space. The measure used for evaluat- 
ing the collection selection is called Normalised Cumula- 
tive Cluster Gain (NCCG) l26l . 

The Cumulative Gain of a Cluster (CCG) is defined by 
the number of relevant documents in a cluster, CCG(c, t) = 
S™=i R e h- A sorted vector CG is created for a clustering 
solution, c, and a topic, t, where each element represents 
the CCG of a cluster. It is normalised by the ideal gain 
vector, 

l CG l fr>r>\ 
v - cumsum CG 

SplitScore(<, c) = ^ ^ '-, (4) 

n r 

where n r is total number of relevant documents for the 
topic, t. The worst possible split places one relevant docu- 
ment in each cluster represented by the vector CGI, 

|CG1 



required with a multi label clustering as one document has 
to be stored and processed on more than one computer. A 
ground truth can be considered a clustering and compared 
to another ground truth to measure how well the ground 
truths fit each other. Furthermore, a ground truth can be 
used as a clustering solution and used for collection selec- 
tion. 

The evaluation of document clustering using ad hoc in- 
formation retrieval can be viewed as being similar to an 
evaluation using a multi label category based ground truth. 
A document can be relevant to more than one query. How- 
ever, unlike a category based approach, each query is eval- 
uated separately and then averaged across all queries. In 
contrast, all categories are evaluated at once and the score 
is not averaged across categories. 

5 Ineffective Clustering 

In this paper we introduce the concept of an ineffective 
clustering. An ineffective clustering produces a high score 
according to an evaluation measure but does not represent 
any inherent value as a clustering solution. 

The Purity evaluation measure has an obvious ineffective 
case. If each cluster contains one document then it is 100% 
pure with respect to the ground truth. A single document 
is the majority of the cluster. As the goal of clustering is to 
produce groups of documents or to summarise the collec- 
tion, this is obviously flawed as it does neither. The same 
applies to the Entropy measure as the probability of a la- 
bel for a cluster is 100%, resulting in the highest possible 
Entropy score. 

The NCCG measure is ineffective when one cluster con- 
tains all the documents except for every other cluster con- 
taining one document. The NCCG measure orders clusters 
by the number of relevant documents they contain. A large 
cluster containing most documents will almost always be 
ranked first. Therefore, almost all relevant documents will 
exist in one cluster, achieving almost the highest score pos- 
sible. 



MinS P htScore(t , c) = V " m J CG1 ) . (5) 6 Work Performed by a Clustering 

l- — * _____ 



NCCG is calculated using the previous functions, 

, SplitScore(i, c) - MinSplitScore(i, c) 
NCCG(i, c) = — 



1 - MinSplitScore(i, c) 



(6) 



It is then averaged across all topics. 



4.2 Single and Multi Label Evaluation 

Both the clustering approaches and the ground truth can 
be single or multi label. Examples of algorithms that pro- 
duce multi label clusterings are soft or fuzzy approaches 
such as fuzzy c-means |4j, Latent Dirichlet Allocation 
or Expectiation Maximisation (9] . A ground truth is multi 
label if it allows more than one category label for each doc- 
ument. Any combination of single or multi label cluster- 
ings or ground truths are able to be used for evaluation. 
However, it is only reasonable to compare approaches us- 
ing the same combination of single or multi label clustering 
and ground truths. Multi label approaches are less restric- 
tive than single label approaches as documents can exist in 
more than one category. There is redundancy in the data 
whether it is clustering or a ground truth. This redundancy 
has a real and physical costs when clustering is used for col- 
lection selection. More storage and compute resources are 



To overcome ineffective clusterings in the previous section, 
we introduce the concept of work performed by a cluster- 
ing approach. Work is defined as an increase in quality of 
a clustering over a simple approach that ignores the docu- 
ments being clustered. A useful clustering performs work 
beyond an approach that is purely random and ignores doc- 
ument content. If a random approach that performs no use- 
ful learning performs equally to an approach that attempts 
to learn from that data, it would appear that nothing has 
been achieved by analysing the data. We suggest that an 
ineffective clustering performs no useful learning. This is 
supported by a theoretical and experimental analysis in the 
following sections. 

Figures Q] and [2] illustrate an approach using a clustering 
algorithm and a random approach that ignores document 
content. The difference in cluster quality between these 
two approaches represents work completed by a clustering 
algorithm. 

7 Divergence from a Random Baseline 

Many measures of cluster quality can give high quality 
scores for particular clustering solutions that are not of high 
quality by changing the number of clusters or number of 
documents in each cluster. 
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Clustering Algorithm 



Documents Ordered by Clustering Algorithm 




Clusters Discovered by Clustering Algorithm 

Figure 1 : A Clustering Produced by a Clustering Algorithm 




Documents Thrown Randomly 



Buckets Representing Cluster Sizes Determined by Algorithm 

Figure 2: A Random Baseline Distributing Documents into 
Buckets the Same Size as a Clustering 



Measures that can be misled by creating an ineffective 
clustering can be adjusted by subtraction from a randomly 
generated clustering with the same number of clusters with 
the same number of documents in each cluster. Figures 
[TJ and |2 highlight this example where the random base- 
line distributes documents into buckets the same size as the 
clusters found by the clustering algorithm. Apart from the 
random assignment of documents to clusters, the random 
baseline appears the same as the real solution. Therefore, 
each clustering evaluated requires a random baseline that is 
specific to that clustering. The baseline is created by shuf- 
fling the documents uniformly randomly and splitting them 
into clusters the same size as the clustering being measured. 
The score for the random baseline clustering is subtracted 
from the matching clustering being measured. 

The divergence from a random baseline approach can be 
applied to any measure of cluster quality whether it is in- 
trinsic of extrinsic. However, it does require an existing 
measure of cluster quality. It is not a measure by itself but 
an approach to ensure a clustering is doing something sen- 
sible. Although we have highlighted its use for document 
clustering evaluation, it can be used for any clustering eval- 
uation. 

There are two issues at play here. Firstly, different distri- 
butions of cluster sizes can lead to arbitrarily high scores. 
The second issue is determining if the clustering algorithm 
is effectively learning with respect to a measure of quality. 



The divergence from a random baseline takes care of inef- 
fective solutions in either case. If the internal ordering of 
clusters is no better than random noise then it achieves a 
score of zero. A negative score could be achieved as the 
random baseline scores a positive value using most mea- 
sures on most data sets. It is possible for a clustering to 
have a worse score than the baseline. For example, a clus- 
tering approach could maximise dissimilarity of documents 
in clusters. This will create a solution where the most dis- 
similar documents are placed together, resulting in a worse 
score than random assignment. The random assignment 
does not bias the clustering towards or away from the mea- 
sure of quality. If a clustering approach is in fact learning 
something with respect to the measure of quality, then it 
is expected that is will be biased towards it. Alternatively, 
if we reverse the optimisation process, it should be biased 
away from it. 

Let to — {wi, W2, ■ ■ ■ , wk} be the set of clusters for the 
document collection D and £ = {ci, c%, . . . , cj} be the set 
of categories. Each cluster and category is a subset of the 
document collection, Vc £ £,w £ u : c, w C D. We 
define the probability of a category in the baseline given 

a cluster as, Pb(cj\wk) — yfyr] ■ The probability of a 
category given a cluster in the baseline only depends on 
the size of the categories. The baseline is a uniformly ran- 
domly shuffled list of documents that has been split into 
clusters that match the cluster size distribution in the solu- 
tion being evaluated. Thus, within each cluster in the base- 
line is random uniform noise. It is not biased by the docu- 
ment representation. So, it is expected categories will occur 
at a rate proportional to the category's size. For example, 
if there are three categories A, B, C containing 10, 20, 30 
documents, each cluster in the baseline is expected to con- 
tain approximately ^A, and|2C. This only reflects 
the size distribution of the categories. 

We let any measure of a cluster quality be interpreted as 
a probability. Although this is not formally the case for all 
measures, it serves as a reasonable explanation. We define 
the probability of a category in a cluster given the ground 
truth as, P s (cj \wf~) = any measure of cluster quality. 

The Purity measure assigns an actual probability to each 
cluster when there is a single label ground truth. All the 

probabilities combined accumulate to one, ^ c j^ fc = 
1, and the category with the largest maximum likelihood 
estimate is assigned to each cluster, Pp ur ity(cj \wk) = 
argmax Cj ^uj*^ ■ This is the proportion of the cluster 
that has the majority category label. It also represents the 
same process of using clustering for classification with la- 
beled data where an unseen sample is labeled based on the 
majority category label of the cluster it is nearest neigh- 
bour to. We define d as a document in D. The ground 
truth is restricted to being single label where a document, 
d, only has only one label in one category in the ground 
truth, Vd £ D, a £ £, c 3 £ £ : d £ a A d £ Cj A c, ^ c r 

The adjusted measure is the difference between the sub- 
mission and the baseline. We define the adjusted prob- 
ability of a category given a cluster as, P a (cj\wk) = 

P s {Cj\wk) - P b (Cj\w k ). 

An alternative formal view of divergence from a ran- 
dom baseline can be defined by a quality function, m : 
PP(Z xZ) 4 I, that takes a set of clusters as a set of 
set of (document, category label) pairs, s, and returns a 
real number indicating the quality of the clustering. Ex- 
amples of these cluster quality functions are Entropy, Fl, 



NCCG, Negentropy, NMI and Purity. There exists a func- 
tion, r : PP(Z xZ)4 PP(Z x Z), that generates a ran- 
dom baseline, b, given a clustering solution, s. The baseline 
has the same number of clusters as the clustering solution, 
|6| = \s\. For every cluster in each of the original clus- 
tering, s, and the baseline, b, the corresponding clusters 
contain the same number of documents, V/c : \sk\ = \bk\- 
The adjusted measure, m a : PP(Z x Z) — s- K, becomes, 
nT, a (s) = m(s) — m(r(s)). 

8 Application at the INEX 2010 XML 
Mining Track 

Participants were asked to submit multiple clustering solu- 
tions containing approximately 50, 100, 200, 500 and 1000 
clusters. The categories extracted contained 36 categories 
due to only using categories with greater than 3000 docu- 
ments. This choice was arbitrary and the decision for clus- 
ter sizes was made based on the number of documents in 
the collection before the categories were extracted. The 
number of categories in a document collection is subjec- 
tive. Therefore, a direct comparison of 36 clusters with 36 
categories is not necessary. Measuring how the categories 
behave over multiple cluster sizes indicates the quality of 
clusters and the trend can be visualised. 

A legend for Figures 
|4] to [9] can be found in 
Figure [3] The Struc- 
tured Linked Vector 
Model (SLVM) ED 
incorporates document 
structure, links and 
content. The k-star 1 34 1 
is an iterative clustering 
method for grouping 

documents. The TopSig approach II 161 produces binary 
strings that represent documents and a modified k-means 
algorithm that works directly with this representation. 



O k-star 

Random with Uniform Cluster Size 
□ Structured Linked Vector Model 
-+- TopSig 1024 bit k-means 

Figure 3: Legend 
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Figure 4: Purity 

Submissions using the k-star method at INEX 2010 13411 
contained several large clusters and many other small clus- 
ters. This exposed weakness in the NCCG measure, which 
resulted in inappropriately high scores. When the scores 
are subtracted from a random baseline with the same prop- 
erties they performed no better than a randomly generated 
solution. This can be clearly seen in Figures|6]and|7]where 
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Figure 5: Purity Subtracted from a Random Baseline 

the k-star method changes drastically between the original 
score and the score when subtracted from a random base- 
line. 

The NMI measure is almost unaffected by subtraction 
from a random baseline where as other measures have a 
larger difference. Figures[8]and|9]highlight this property on 
submissions from INEX 2010. This suggests that the nor- 
malisation we have proposed is similar to that of NMI but 
is applicable to any measure of cluster quality whether it is 
intrinsic or extrinsic. Figures [4] and [5] demonstrate how the 
difference between the adjusted and unadjusted measures is 
larger for measures that are not normalised. Each line rep- 
resents a different document clustering system. The bottom 
most line in each graph is a randomly generated cluster- 
ing submission where a category for a document is selected 
uniformly at random from the set of categories. Note that 
this random clustering in the figures differs from the ran- 
dom baseline. The cluster size distribution is also uniform. 
A random baseline has a cluster size distribution that is spe- 
cific to the clustering being evaluated. When compared to 
the random baseline the expected results are achieved, with 
a score of zero for all cluster sizes. Note that without ad- 
justing the cluster size distribution, it is not able to differ- 
entiate ineffective clusterings as per the NCCG metric in 
Figure[7] Subtracting the random submission with uniform 
cluster sizes from the NCCG submission does not reduce 
its score to zero as can be seen in Figure [6] 

Figures [10] and [TT] demonstrate the application of the di- 
vergence from random baseline approach on an intrinsic 
measure. RMSE is the Root Mean Squared Error of the 
clustering using the cosine similarity measure. The higher 
value the better the clustering. A cosine similarity of 1 in- 
dicates the document and the cluster centre are identical. 
A score of indicates they are orthogonal and therefore 
have no overlap in vocabulary. This experiment was run 
on a 10,000 document randomly selected sample. The k- 
means algorithm was used to produce k clusters between 
1 and 10,000. Subtraction from a random baseline assigns 
a score of zero to these ineffective cases. Furthermore, it 
provides a clear maximum for RMSE. 

9 Conclusion 

In this paper we introduced problems encountered in evalu- 
ation of document clustering. This is the concept of ineffec- 
tive clustering and a notion of work. The divergence from 
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Figure 6: NCCG 



Figure 9: NMI Subtracted from a Random Baseline 
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Figure 7: NCCG Subtracted from a Random Baseline 



Figure 10: RMSE 
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Figure 1 1 : RMSE Subtracted from a Random Baseline 



random baseline approach deals with these corner cases 
and increases the confidence that a clustering approach is 
achieving meaningful learning with respect to any view of 
cluster quality. It is also applicable to any clustering eval- 
uation but was only discussed in the context of document 



clustering in this paper. 

Divergence from a random baseline was formally de- 
fined and analysed experimentally with both intrinsic and 
extrinsic measures of cluster quality. Furthermore, this ap- 
proach appears to be performing a normalisation similar to 



that performed by NMI. It also provides a clear optimum 
for distortion as measured by RMSE. 
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